Text Analysis with UMAP#

We are going to use UMAP to embed text. We are going to use the 20 newsgroups dataset which is a collection of forum posts labelled by topic. We are going to embed these documents and see that similar documents (i.e. posts in the same subforum) will end up close together.

You can use this embedding for other downstream tasks, such as visualizing your corpus, or run a clustering algorithm.

!pip install umap-learn[plot]
Requirement already satisfied: umap-learn[plot] in /opt/homebrew/lib/python3.10/site-packages (0.5.5)
Requirement already satisfied: scipy>=1.3.1 in /opt/homebrew/lib/python3.10/site-packages (from umap-learn[plot]) (1.12.0)
Requirement already satisfied: scikit-learn>=0.22 in /opt/homebrew/lib/python3.10/site-packages (from umap-learn[plot]) (1.4.0)
Requirement already satisfied: tqdm in /opt/homebrew/lib/python3.10/site-packages (from umap-learn[plot]) (4.64.1)
Requirement already satisfied: pynndescent>=0.5 in /opt/homebrew/lib/python3.10/site-packages (from umap-learn[plot]) (0.5.11)
Requirement already satisfied: numba>=0.51.2 in /opt/homebrew/lib/python3.10/site-packages (from umap-learn[plot]) (0.58.1)
Requirement already satisfied: numpy>=1.17 in /opt/homebrew/lib/python3.10/site-packages (from umap-learn[plot]) (1.24.1)
Requirement already satisfied: seaborn in /opt/homebrew/lib/python3.10/site-packages (from umap-learn[plot]) (0.13.1)
Requirement already satisfied: matplotlib in /opt/homebrew/lib/python3.10/site-packages (from umap-learn[plot]) (3.6.3)
Collecting datashader
  Using cached datashader-0.16.0-py2.py3-none-any.whl (18.3 MB)
Collecting colorcet
  Using cached colorcet-3.0.1-py2.py3-none-any.whl (1.7 MB)
Collecting holoviews
  Using cached holoviews-1.18.1-py2.py3-none-any.whl (4.3 MB)
Requirement already satisfied: pandas in /opt/homebrew/lib/python3.10/site-packages (from umap-learn[plot]) (1.5.3)
Collecting scikit-image
  Downloading scikit_image-0.22.0-cp310-cp310-macosx_12_0_arm64.whl (13.3 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 13.3/13.3 MB 10.5 MB/s eta 0:00:0000:0100:01
?25hCollecting bokeh
  Downloading bokeh-3.3.4-py3-none-any.whl (6.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 6.8/6.8 MB 12.4 MB/s eta 0:00:00a 0:00:01
?25hRequirement already satisfied: llvmlite<0.42,>=0.41.0dev0 in /opt/homebrew/lib/python3.10/site-packages (from numba>=0.51.2->umap-learn[plot]) (0.41.1)
Requirement already satisfied: joblib>=0.11 in /opt/homebrew/lib/python3.10/site-packages (from pynndescent>=0.5->umap-learn[plot]) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /opt/homebrew/lib/python3.10/site-packages (from scikit-learn>=0.22->umap-learn[plot]) (3.2.0)
Requirement already satisfied: pillow>=7.1.0 in /opt/homebrew/lib/python3.10/site-packages (from bokeh->umap-learn[plot]) (9.4.0)
Collecting PyYAML>=3.10
  Downloading PyYAML-6.0.1-cp310-cp310-macosx_11_0_arm64.whl (169 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 169.3/169.3 kB 9.6 MB/s eta 0:00:00
?25hRequirement already satisfied: packaging>=16.8 in /opt/homebrew/lib/python3.10/site-packages (from bokeh->umap-learn[plot]) (23.0)
Collecting xyzservices>=2021.09.1
  Using cached xyzservices-2023.10.1-py3-none-any.whl (56 kB)
Collecting tornado>=5.1
  Downloading tornado-6.4-cp38-abi3-macosx_10_9_universal2.whl (433 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 433.1/433.1 kB 6.7 MB/s eta 0:00:00a 0:00:01
?25hCollecting Jinja2>=2.9
  Downloading Jinja2-3.1.3-py3-none-any.whl (133 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 133.2/133.2 kB 10.8 MB/s eta 0:00:00
?25hRequirement already satisfied: contourpy>=1 in /opt/homebrew/lib/python3.10/site-packages (from bokeh->umap-learn[plot]) (1.0.7)
Requirement already satisfied: pytz>=2020.1 in /opt/homebrew/lib/python3.10/site-packages (from pandas->umap-learn[plot]) (2022.7.1)
Requirement already satisfied: python-dateutil>=2.8.1 in /opt/homebrew/lib/python3.10/site-packages (from pandas->umap-learn[plot]) (2.8.2)
Collecting pyct>=0.4.4
  Using cached pyct-0.5.0-py2.py3-none-any.whl (15 kB)
Collecting param
  Using cached param-2.0.2-py3-none-any.whl (113 kB)
Collecting multipledispatch
  Using cached multipledispatch-1.0.0-py3-none-any.whl (12 kB)
Collecting requests
  Using cached requests-2.31.0-py3-none-any.whl (62 kB)
Collecting toolz
  Downloading toolz-0.12.1-py3-none-any.whl (56 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 56.1/56.1 kB 6.8 MB/s eta 0:00:00
?25hCollecting xarray
  Downloading xarray-2024.1.1-py3-none-any.whl (1.1 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.1/1.1 MB 15.3 MB/s eta 0:00:00a 0:00:01
?25hCollecting dask
  Downloading dask-2024.1.1-py3-none-any.whl (1.2 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.2/1.2 MB 6.9 MB/s eta 0:00:00a 0:00:01m
?25hCollecting pyviz-comms>=0.7.4
  Using cached pyviz_comms-3.0.1-py3-none-any.whl (82 kB)
Collecting panel>=1.0
  Downloading panel-1.3.8-py2.py3-none-any.whl (20.8 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 20.8/20.8 MB 9.4 MB/s eta 0:00:0000:01m00:01
?25hRequirement already satisfied: pyparsing>=2.2.1 in /opt/homebrew/lib/python3.10/site-packages (from matplotlib->umap-learn[plot]) (3.0.9)
Requirement already satisfied: fonttools>=4.22.0 in /opt/homebrew/lib/python3.10/site-packages (from matplotlib->umap-learn[plot]) (4.38.0)
Requirement already satisfied: cycler>=0.10 in /opt/homebrew/lib/python3.10/site-packages (from matplotlib->umap-learn[plot]) (0.11.0)
Requirement already satisfied: kiwisolver>=1.0.1 in /opt/homebrew/lib/python3.10/site-packages (from matplotlib->umap-learn[plot]) (1.4.4)
Collecting imageio>=2.27
  Downloading imageio-2.33.1-py3-none-any.whl (313 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 313.3/313.3 kB 2.6 MB/s eta 0:00:0000:0100:01
?25hCollecting networkx>=2.8
  Downloading networkx-3.2.1-py3-none-any.whl (1.6 MB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 1.6/1.6 MB 7.2 MB/s eta 0:00:00a 0:00:010m
?25hCollecting tifffile>=2022.8.12
  Downloading tifffile-2024.1.30-py3-none-any.whl (224 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 224.1/224.1 kB 14.4 MB/s eta 0:00:00
?25hCollecting lazy_loader>=0.3
  Downloading lazy_loader-0.3-py3-none-any.whl (9.1 kB)
Collecting MarkupSafe>=2.0
  Downloading MarkupSafe-2.1.4-cp310-cp310-macosx_10_9_universal2.whl (17 kB)
Collecting linkify-it-py
  Downloading linkify_it_py-2.0.2-py3-none-any.whl (19 kB)
Collecting markdown-it-py
  Using cached markdown_it_py-3.0.0-py3-none-any.whl (87 kB)
Collecting mdit-py-plugins
  Downloading mdit_py_plugins-0.4.0-py3-none-any.whl (54 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 54.1/54.1 kB 7.1 MB/s eta 0:00:00
?25hCollecting typing-extensions
  Downloading typing_extensions-4.9.0-py3-none-any.whl (32 kB)
Collecting bleach
  Downloading bleach-6.1.0-py3-none-any.whl (162 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 162.8/162.8 kB 13.3 MB/s eta 0:00:00
?25hCollecting markdown
  Downloading Markdown-3.5.2-py3-none-any.whl (103 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 103.9/103.9 kB 12.2 MB/s eta 0:00:00
?25hRequirement already satisfied: six>=1.5 in /opt/homebrew/lib/python3.10/site-packages (from python-dateutil>=2.8.1->pandas->umap-learn[plot]) (1.16.0)
Collecting importlib-metadata>=4.13.0
  Downloading importlib_metadata-7.0.1-py3-none-any.whl (23 kB)
Collecting fsspec>=2021.09.0
  Downloading fsspec-2023.12.2-py3-none-any.whl (168 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 169.0/169.0 kB 13.1 MB/s eta 0:00:00
?25hCollecting partd>=1.2.0
  Using cached partd-1.4.1-py3-none-any.whl (18 kB)
Collecting click>=8.1
  Using cached click-8.1.7-py3-none-any.whl (97 kB)
Collecting cloudpickle>=1.5.0
  Downloading cloudpickle-3.0.0-py3-none-any.whl (20 kB)
Collecting urllib3<3,>=1.21.1
  Using cached urllib3-2.1.0-py3-none-any.whl (104 kB)
Collecting certifi>=2017.4.17
  Using cached certifi-2023.11.17-py3-none-any.whl (162 kB)
Collecting charset-normalizer<4,>=2
  Downloading charset_normalizer-3.3.2-cp310-cp310-macosx_11_0_arm64.whl (120 kB)
     ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 120.4/120.4 kB 10.2 MB/s eta 0:00:00
?25hCollecting idna<4,>=2.5
  Using cached idna-3.6-py3-none-any.whl (61 kB)
Collecting zipp>=0.5
  Downloading zipp-3.17.0-py3-none-any.whl (7.4 kB)
Collecting locket
  Using cached locket-1.0.0-py2.py3-none-any.whl (4.4 kB)
Collecting webencodings
  Downloading webencodings-0.5.1-py2.py3-none-any.whl (11 kB)
Collecting uc-micro-py
  Downloading uc_micro_py-1.0.2-py3-none-any.whl (6.2 kB)
Collecting mdurl~=0.1
  Downloading mdurl-0.1.2-py3-none-any.whl (10.0 kB)
Installing collected packages: webencodings, multipledispatch, zipp, xyzservices, urllib3, uc-micro-py, typing-extensions, tornado, toolz, tifffile, PyYAML, param, networkx, mdurl, MarkupSafe, markdown, locket, lazy_loader, imageio, idna, fsspec, cloudpickle, click, charset-normalizer, certifi, bleach, scikit-image, requests, pyviz-comms, pyct, partd, markdown-it-py, linkify-it-py, Jinja2, importlib-metadata, xarray, mdit-py-plugins, dask, colorcet, bokeh, panel, datashader, holoviews
Successfully installed Jinja2-3.1.3 MarkupSafe-2.1.4 PyYAML-6.0.1 bleach-6.1.0 bokeh-3.3.4 certifi-2023.11.17 charset-normalizer-3.3.2 click-8.1.7 cloudpickle-3.0.0 colorcet-3.0.1 dask-2024.1.1 datashader-0.16.0 fsspec-2023.12.2 holoviews-1.18.1 idna-3.6 imageio-2.33.1 importlib-metadata-7.0.1 lazy_loader-0.3 linkify-it-py-2.0.2 locket-1.0.0 markdown-3.5.2 markdown-it-py-3.0.0 mdit-py-plugins-0.4.0 mdurl-0.1.2 multipledispatch-1.0.0 networkx-3.2.1 panel-1.3.8 param-2.0.2 partd-1.4.1 pyct-0.5.0 pyviz-comms-3.0.1 requests-2.31.0 scikit-image-0.22.0 tifffile-2024.1.30 toolz-0.12.1 tornado-6.4 typing-extensions-4.9.0 uc-micro-py-1.0.2 urllib3-2.1.0 webencodings-0.5.1 xarray-2024.1.1 xyzservices-2023.10.1 zipp-3.17.0

[notice] A new release of pip is available: 23.0.1 -> 23.3.2
[notice] To update, run: python3.10 -m pip install --upgrade pip
import pandas as pd
import umap
import umap.plot

# Used to get the data
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer

# Some plotting libraries
import matplotlib.pyplot as plt
%matplotlib notebook
from bokeh.plotting import show, save, output_notebook, output_file
from bokeh.resources import INLINE
output_notebook(resources=INLINE)
Loading BokehJS ...
dataset = fetch_20newsgroups(subset='all',
                             shuffle=True, random_state=42)
print(f'{len(dataset.data)} documents')
print(f'{len(dataset.target_names)} categories')
18846 documents
20 categories

Here are the categories of documents. As you can see many are related to one another (e.g. β€˜comp.sys.ibm.pc.hardware’ and β€˜comp.sys.mac.hardware’) but they are not all correlated (e.g. β€˜sci.med’ and β€˜rec.sport.baseball’).

dataset.target_names
['alt.atheism',
 'comp.graphics',
 'comp.os.ms-windows.misc',
 'comp.sys.ibm.pc.hardware',
 'comp.sys.mac.hardware',
 'comp.windows.x',
 'misc.forsale',
 'rec.autos',
 'rec.motorcycles',
 'rec.sport.baseball',
 'rec.sport.hockey',
 'sci.crypt',
 'sci.electronics',
 'sci.med',
 'sci.space',
 'soc.religion.christian',
 'talk.politics.guns',
 'talk.politics.mideast',
 'talk.politics.misc',
 'talk.religion.misc']

Let’s check a sample of them:

for idx, document in enumerate(dataset.data[:2]):
    category = dataset.target_names[dataset.target[idx]]

    print(f'Category: {category}')
    print('---------------------------')
    # Print the first 500 characters of the post
    print(document[:500])
    print('---------------------------')
Category: rec.sport.hockey
---------------------------
From: Mamatha Devineni Ratnam <mr47+@andrew.cmu.edu>
Subject: Pens fans reactions
Organization: Post Office, Carnegie Mellon, Pittsburgh, PA
Lines: 12
NNTP-Posting-Host: po4.andrew.cmu.edu



I am sure some bashers of Pens fans are pretty confused about the lack
of any kind of posts about the recent Pens massacre of the Devils. Actually,
I am  bit puzzled too and a bit relieved. However, I am going to put an end
to non-PIttsburghers' relief with a bit of praise for the Pens. Man, they
are killin
---------------------------
Category: comp.sys.ibm.pc.hardware
---------------------------
From: mblawson@midway.ecn.uoknor.edu (Matthew B Lawson)
Subject: Which high-performance VLB video card?
Summary: Seek recommendations for VLB video card
Nntp-Posting-Host: midway.ecn.uoknor.edu
Organization: Engineering Computer Network, University of Oklahoma, Norman, OK, USA
Keywords: orchid, stealth, vlb
Lines: 21

  My brother is in the market for a high-performance video card that supports
VESA local bus with 1-2MB RAM.  Does anyone have suggestions/ideas on:

  - Diamond Stealth Pro Local 
---------------------------

Now we will create a dataframe with the target labels to be used in plotting. This will allow us to see the newsgroup when we hover over the plotted points (if using interactive plotting). This will help us evaluate (by eye) how good the embedding looks.

category_labels = [dataset.target_names[x] for x in dataset.target]
hover_df = pd.DataFrame(category_labels, columns=['category'])

Representing text in Machine Learning#

We need to convert the text into a numerical representation. There are many ways to do this, but we will use a simple bag of words representation. This is a simple count of the number of times each word appears in a document, without considering the order of the words.

bow

We will use sklearns CountVectorizer function to do this for us along with a couple other preprocessing steps:

  1. Split the text into tokens (i.e. words) by splitting on whitespace.

  2. Remove english stopwords (the, and, etc), to remove noise from the data.

  3. Remove all infrequent words (which occur less than 5 times) in the entire corpus (via the min_df parameter).

Reference: https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction

vectorizer = CountVectorizer(min_df=5, stop_words='english')
word_doc_matrix = vectorizer.fit_transform(dataset.data)

Question: How many words are there in our final vocabulary?

word_doc_matrix.shape
(18846, 34880)

Now we are going to do dimension reduction using UMAP to reduce the matrix from 34880 dimensions to 2 dimensions (since n_components=2). We need a distance metric and will use Hellinger distance which measures the similarity between two probability distributions. Each document has a set of counts generated by a multinomial distribution where we can use Hellinger distance to measure the similarity of these distributions.

TLDR: Hellinger distance is a good distance metric for comparing texts processed with bag of words.

embedding = umap.UMAP(n_components=2, metric='hellinger').fit(word_doc_matrix)

Let’s plot the embedding

f = umap.plot.interactive(embedding, labels=dataset.target, hover_data=hover_df, point_size=3)
show(f)

As you can see this does reasonably well. There is some separation and groups that you would expect to be similar (such as β€˜rec.sport.baseball’ and β€˜rec.sport.hockey’) are close together. The big clump in the middle corresponds to a lot of extremely similar newsgroups like β€˜comp.sys.ibm.pc.hardware’ and β€˜comp.sys.mac.hardware’.

Applications#

Now that we have an embedding, there are several things we can do with it:

  • Explore/visualize your corpus to identify topics/trends

  • Cluster the embedding to find groups of related documents

  • Look for nearest neighbours to find related documents

  • Look for anomalous documents

Exercise 1. DisneyLand Reviews#

bow

Use the DisneyLand reviews dataset (csv found in the Blackboard) to create a UMAP embedding of the text of the reviews and visualize it.

  • As the color of the plot, you can use the review rating.

  • You may need to modify the hover_df structure to include the Review_Text, so you can read the text of the review when hovering over the plot.

Can you see any patterns in the plot?

Exercise 2. Philosophy Texts#

thinker

Use the Philosophy text collection (.zip file found in the Blackboard) to create a UMAP embedding of the text of the documents and visualize it.

  • As the color of the plot, you can use the author, which is the first word of the filename.

  • You may need to modify the hover_df structure to include the filename, so you can read the name of the book when hovering over the plot.

Can you identify schools of philosophy in the plot?

import glob

results = []

list_of_files = glob.glob("./phil_txts/*.txt")

for file in list_of_files:

    with open(file, mode="r", encoding="iso-8859-1") as f:

        text = f.read()

        name = file.split('/')[-1].split('_')[0]

    results.append(
        {"text": text,
         "author" : name
        }
    )
        
import pandas as pd


df = pd.DataFrame(results)
df['text'] = df['text'].str.replace('\n', " ")